Creative Commons License
MSGSU ISTATISTIK BOLUMU - R ILE ISTATISTIKSEL PROGRAMLAMA DERS NOTLARI by is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Hata ve öneriler için emoji::("email") ***

LIBRARIES

.packages = c("dplyr", "hflights", "gapminder","kableExtra")
.inst <- .packages %in% installed.packages()
if(length(.packages[!.inst]) > 0) install.packages(.packages[!.inst])
lapply(.packages, require, character.only=TRUE)

REFERENCES

https://dplyr.tidyverse.org


1 Data manipulation verbs

The dplyr package was developed by Hadley Wickham of RStudio and is an optimized and distilled version of his plyr package. The dplyr package does not provide any “new” functionality to R per se, in the sense that everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R.

The dplyr package contains five key data manipulation functions, also called verbs:

verb
select() returns a subset of the columns,
filter() returns a subset of the rows,
arrange() reorders the rows according to single or multiple variables,
mutate() add columns from existing data,
summarise() which reduces each group to a single row by calculating aggregate measures.

Common dplyr Function Properties

All of the functions that we will discuss will have a characteristics.

  • The first argument is a data frame.
  • The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names).
  • The return result of a function is a new data frame
  • Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy. In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.

data.frame

library(hflights)
data(hflights)
head(hflights)
str(hflights)
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
is.data.frame(hflights)
## [1] TRUE
is.tbl(hflights)
## [1] FALSE

2 select( )

select() allows you to choose the columns of interest out of your data frame.

select(df, var1, var2)

  • You can use : to select a range of variables and - to exclude some variables
  • You can use both variable’s names as well as integer indexes.
select(df, 1:4, -2)
  • select() does not change the data frame; you have to assign the result to a variable to store.
  • select Always returns a tibble.

ex: select first 3 variables of hflights

names(hflights)
##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"
head(select(hflights,Year, Month, DayofMonth))
head(select(hflights, 1:3))
head(select(hflights,Year:DayofMonth))

ex:select Airtime and ArrDelay

head(select(hflights, AirTime, ArrDelay))

ex: view first 10 observations of AirTime, ArrDelay and DepDelay

head(select(hflights, AirTime, ArrDelay, DepDelay),10)

ex: view first 10 observations of the variables between Origin and Cancelled

head(select(hflights, Origin:Cancelled),10)

inspect:

head(select(hflights, - (DepTime:AirTime)),10)

2.1 helper functions for variable selection

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

function
starts_with("X") every name that starts with “X”,
ends_with("X") every name that ends with “X”,
contains("X") every name that contains “X”,
matches("X") every name that matches “X”, where “X” can be a regular expression,
num_range("x", 1:5) the variables named x01, x02, x03, x04 and x05,
one_of(x) every name that appears in x, which should be a character vector.

ex: view first 10 observations of ArrDelay and DepDelay

head(select(hflights, ends_with("Delay")),10)

ex: view first 10 observations of UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode

head(select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel")),10)

ex: view first 10 observations of DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay and DepDelay

head(select(hflights, contains("Tim"), contains("Del")),10)

r base and dplyr difference

ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, contains("Taxi"), Distance)
  
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year:ArrTime, -DayofMonth)
  
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, starts_with("T"))

3 filter( )

select rows

filtered.rows <- filter(DATAFRAME, CONDITION-1, CONDITION-2, ..., CONDITION-N)

ex: 3000 ve daha fazla mil yapilan butun ucuslari listeleyin (Distance degiskeni)

filter(hflights, Distance >= 3000)

Q 15: JetBlue, Southwest ya da Delta tarafinda gerceklestirilen tum ucuslari listeleyin (UniqueCarrier degiskeni)

filter(hflights, UniqueCarrier %in% c("JetBlue", "Southwest", "Delta"))

Q 16: filter() kullanarak TaxiIn ve TaxiOut toplami ucus suresinden(AirTime) fazla olan ucuslari listeleyin

filter(hflights, TaxiIn + TaxiOut > AirTime)

boolean operators in filter():

  • & (and),
  • | (or),
  • ! (not)

Instead of the & operator, you can pass several logical tests to filter(), separated by commas.

select rows

filter(df, a > 0 & b > 0)
filter(df, a > 0, b > 0)

is.na() keeps the observations in df for which the variable x is not NA:
filter(df, !is.na(x))

Q 17: 5:00 am den once (DepTime 500) kalkan veya 10:00 pm den sonra(ArrTime 2200) gelen ucuslari listeleyin

filter(hflights, DepTime < 500 | ArrTime > 2200)

Q 18: Gec kalkan (DepDelay, delay >0) fakat beklenen sureden once inis yapan(ArrDelay >0) ucuslari listeleyin

filter(hflights, DepDelay > 0, ArrDelay < 0)

Q 19: Gec kalktigindan (DepDelay > 0) iptal edilen (Canceled 1 olmali) ucuslari listeleyin

filter(hflights, Cancelled == 1, DepDelay > 0)

Q 20: JFK havaalanina(Dest JFK olmali) inis yapilacak ucuslari c1 olarak kaydedin

c1 <- filter(hflights, Dest == "JFK")

Q 21: c1 e Date isimli yeni degisken ekleyin ve c2 olarak kaydedin. eklenen degisken Yil-Ay-G??n formatinda olsun (Year, Month, DayofMonth)

c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))

Q 22: c2 veri setinden srasiyla Date, DepTime, ArrTime ve TailNum degiskenlerini ekrana yazdirin

select(c2, Date, DepTime, ArrTime, TailNum)

How many weekend flights flew a distance of more than 1000 miles but had a total taxiing time below 15 minutes?

3.1 exercise

Use the dataset in package gapminder for the exercises

gapminder::gapminder

3.1.1 ex1

filter(gapminder, country=="Canada", year < 1970)

3.1.2 ex2

filter(gapminder, 
       country == "Canada" |
         country == "Algeria")

3.1.3 ex3

filter(gapminder,
      country %in% c("Canada", "Algeria"))

3.1.4 ex4

filter(gapminder,
       country %in% c("Canada", "Algeria"), year < 1970, year >= 1960)

3.1.5 ex5

filter(gapminder,
       (country == "Canada") |
         (country == "Algeria" & 
            year %in% 1960:1969))

3.1.6 ex6

filter(gapminder, 
       continent != "Europe")

3.1.7 ex7

gapminder %>% 
  filter(country %in% c("Canada", "Algeria"), 
         year <= 1969, year >= 1960) %>% 
  select(country, year, gdpPercap)

4 pipe operator

Ctrl + Shift + M (Windows), Cmd + Shift + M (Mac).

The pipe operator takes the thing on the left-hand-side and pipes it into

the function call on the right-hand-side ??? literally, drops it in as the first argument.

head(gapminder)
gapminder %>% head()
select(gapminder, year, lifeExp)
gapminder %>% 
  select(year, lifeExp) %>%
  head(4)

Q1: UniqueCarrier degiskeninde kodlanmis bulunan sirket isimlerinin tam hallerini Carrier degiskeni olarak olusturarak veri setine ekleyelim

lookup table When you subset the lookup table with a character string R will return the values of the lookup table that correspond to the names in the character string.

a <- c("AA", "AS")
b <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue")
a <- b[a]
a
##         AA         AS 
## "American"   "Alaska"

add a new variable Carrier, with lookup table UniqueCarrier

isim <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

Add the Carrier column to hflights

hflights$Carrier <- isim[hflights$UniqueCarrier]

Glimpse at hflights

dplyr::glimpse(hflights)
## Rows: 227,496
## Columns: 22
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 1443…
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 1554…
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 42…
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "N…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56, 63, …
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41, 44, …
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5, -9, …
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2, -3, …
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IA…
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF…
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 22…
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, 12, 8,…
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, 13, 15…
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Carrier           <chr> "American", "American", "American", "American", "Ame…

Q2: Benzer sekilde CancellationCode degiskeni, isim vektoruyle verilen degerler ile Code degiskeni olarak veri setine eklensin. (NA ornegi)

head(unique(hflights$CancellationCode),10) # unik degerleri gosterir
## [1] ""  "A" "B" "C" "D"
isim <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")
hflights$Code <- isim[hflights$CancellationCode]
head(hflights)

5 mutate( )

The mutate function allows you to create additional columns for your data frame opposite of select

more.columns <- mutate(DATAFRAME, new.column1 = old.column * 2, new.column2 = old.column * 3 )

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to

create the new variable y: mutate(my_df, x = a + b, y = x + c)

Q9: ActualElapsedTime ile AirTime fark??n?? ActualGroundTime degiskeni olarak hflights a ekleyen g1 isimli data.frame olusturun

g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

Q10: g1 e GroundTime degiskenini ekleyin (TaxiIn ile TaxiOut degiskenlerinin toplami) ve g2 olarak kaydedin. GroundTime ve ActualGroundTime degiskenlerinin ayni oldugunu gozlemleyin.

g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)
head(select(g2, GroundTime, ActualGroundTime),10)

Q11: g2 ye AverageSpeed degiskenini (Distance / AirTime * 60) ekleyin vw g3 olarak kaydedin.

g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)

Q12: m1 i hflights a loss (ArrDelay - DepDelay) ve loss_ratio (ratio of loss to DepDelay) degiskenlerini ekleyerek olusturun

m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)

Q13: m2 isimli tblyi 3 degisken ekleyerek olusturun: TotalTaxi: TaxiIn ve TaxiOut toplami ActualGroundTime: ActualElapsedTime ve AirTime farki Diff: yeni olusturulan 2 degiskenin farki Diff degiskeninin t??m degerlerinin 0 oldugunu gorun

m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, 
             ActualGroundTime = ActualElapsedTime - AirTime,
             Diff = TotalTaxi - ActualGroundTime)

transmutate() drops the variable

head(g1)
head(transmute(g1,NewMonth = DayofMonth + 1),10)

5.1 exercise

Use the dataset in package gapminder for the exercises

gapminder::gapminder

5.1.1 ex1

mutate(gapminder, gdp = gdpPercap * pop)
gapminder %>% 
       mutate(gdp = gdpPercap * pop)

5.1.2 ex2

mutate(gapminder, 
       gdp     = gdpPercap * pop, 
       gdpBill = round(gdp/1000000000, 1))
gapminder %>% 
       mutate(gdp = gdpPercap * pop, 
       gdpBill = round(gdp/1000000000, 1))

5.1.3 ex3

mutate(gapminder, cc = paste(country, continent, sep=", "))
gapminder %>% 
       mutate(cc = paste(country, continent, sep=", "))

6 arrange()

Arranging your data, as sorting rows. defaut asc

dtc

dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))

Arrange dtc by departure delays

head(arrange(dtc, DepDelay),10)
head(arrange(dtc, CancellationCode),10)
head(arrange(dtc, UniqueCarrier, DepDelay),10)
head(arrange(hflights, UniqueCarrier, desc(DepDelay),10)
head(arrange(hflights, DepDelay + ArrDelay),10)
head(dplyr::arrange(gapminder, year, desc(lifeExp)),10)

ex

gapminder %>% 
  arrange(year, desc(lifeExp)) %>% 
  select(year, lifeExp, everything())

7 summarise()

summarise() (summarize is also accepted) creates a summary of a column,computing a single value from multipe values. use group_by() splits the tibble into parts

#summarize(gapminder, mean_pop=mean(pop), sd_pop=sd(pop))
#gapminder %>% 
#  group_by(country) %>% 
 # head(summarize(mean_pop=mean(pop), sd_pop=sd(pop)))
group_by(gapminder, continent, country, year < 1970)
(out1 <- gapminder %>% 
    group_by(continent, country, year < 1970) %>% 
    summarize(mean_pop=mean(pop), sd_pop=sd(pop)))
out1 %>% 
  summarize(mean_pop=mean(mean_pop))
gapminder %>%
  group_by(country) %>% 
  summarize(mingdp=min(gdpPercap))
gapminder %>%
  group_by(country) %>% 
  summarize(n_distinct(year))
gapminder %>% 
  filter(continent=="Asia") %>% 
  group_by(year) %>% 
  summarize(minexp=min(lifeExp),
            maxexp=max(lifeExp))

8 dplyr join tables

https://stat545.com/join-cheatsheet.html

https://dplyr.tidyverse.org/articles/two-table.html

https://r4ds.had.co.nz/relational-data.html

A key is a column or combination of columns, that occurs in each of the tables that you want to join.

when your key appears in your first table, it is called a primary key, because dplyr treats the first table as your primary table.

when the key appears in the second table, it is called secondary table, or foreign key.

this difference is important, because the key has a special job in the primary table: the primary key should uniquely identify each row in the first dataset

a<-data.frame(x1=c("A", "B","C"), x2= c(1,2,3))
b<-data.frame(x1=c("A","B","D"), x2=c("T", "F", "T"))

8.1 left join

band_members
band_instruments

left_join(a, b): Return all rows from a, and all columns from a and b. If there are multiple matches between a and b, all combination of the matches are returned. This is a mutating join.

left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
left_join(a, b, by="x1")
left_join(a, b, by="x1", suffix=c("_1","_2"))
left_join(band_members, band_instruments, by="name")

8.2 right join

right_join(a,b,by="x1")

8.3 inner join

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

inner_join(a,b,by="x1")

8.4 full_join

full_join(a,b,by="x1")

8.5 semi join

semi_join(a,b,by="x1")

8.6 anti_join

anti_join(a,b,by="x1")

9 exercise

How many flights flew to AUS in 2013?

hflights %>%
  filter(Dest == "AUS",Year==2013)

How many flights flew to AUS in first week of January?

# Comma separated conditions are combined with '&'
hflights %>%
 filter(Dest == "AUS", Month == 1, DayofMonth <= 7)

destination to AUS, ORD and TUL

hflights %>%
  filter(Dest == "AUS" | Dest == "ORD" | Dest == "TUL")
hflights %>%
  filter(is.element(Dest, c("AUS", "ORD", "TUL")))

Find longest delayed flights to AUS.

hflights %>%
  filter(Dest == "AUS") %>%
  arrange(desc(DepDelay))

table joining

import sets and themes datasets from legodataset

add theme information to sets table sets n= 11673 themes n = 614

library(readr)
sets<-read_csv("sets.csv")
## Rows: 11673 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): set_num, name
## dbl (3): year, theme_id, num_parts
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
themes<-read_csv("themes.csv")
## Rows: 614 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (2): id, parent_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
a<-sets %>% 
 inner_join(themes, by=c("theme_id" = "id"))
head(a)
a<-sets %>% 
 inner_join(themes, by=c("theme_id" = "id"),   suffix = c("_set", "_theme"))
head(a)
a<-sets %>% 
  inner_join(themes, by=c("theme_id" = "id"), suffix = c("_set", "_theme")) %>%
  count(name_theme, sort = TRUE)

load parts and part_categories tables

parts<-read_csv("parts.csv")
## Rows: 25993 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): part_num, name
## dbl (1): part_cat_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
part_categories<-read_csv("part_categories.csv")
## Rows: 57 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
parts %>% 
    inner_join(part_categories, by = c("part_cat_id" = "id"))
parts %>% 
    inner_join(part_categories, by = c("part_cat_id" = "id"),suffix=c("_part", "_category"))

joining one to many relationships import inventories table

inventories<-read_csv("inventories.csv")
## Rows: 11681 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): set_num
## dbl (2): id, version
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sets %>%
  inner_join(inventories, by="set_num")
sets %>%
  inner_join(inventories, by="set_num")
inventories %>%
  inner_join(sets, by="set_num")
sets %>%
  inner_join(inventories, by="set_num") %>%
  inner_join(themes, by = c( "theme_id" = "id"))
sets %>%
 inner_join(inventories, by="set_num") %>%
  inner_join(themes, by = c( "theme_id" = "id"), suffix = c( "_set", "_theme"))

Creative Commons License
OYKK2020-R ile Veri Önişleme by is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.